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When dealing with modern big data sets, a very common theme is reducing the set through a random 
process. These generally work by making "many simple estimates" of the full data set, and then judging 
them as a whole. Perhaps magically, these "many simple estimates" can provide a very accurate and small 
representation of the large data set. The key tool in showing how many of these simple estimates are needed 
for a fixed accuracy trade-off is the Chernoff-Hoeffding inequality J2l[6]]. This document provides a simple 
form of this bound, and two examples of its use. 

1 Chernoff-Hoeffding Inequality 

We consider a specific form of the Chernoff-Hoeffding bound. It is not the strongest form of the bound, but 
is for many applications asymptotically equivalent, and it also fairly straight-forward to use. 

Theorem 1.1. Consider a set of r independent random variables {X±, . . . , X r }. Let M = Yll=i Xi- Then 



Two slightly less general variants that are often sufficient follow: 

Corollary 1.1. Consider a set ofr independent random variables {X\, . . . , X r }. If we know a% < Xi < bi, 
then let Aj = bi — aj. Let M = Ya=i -^i- Then 



Corollary 1.2. Consider a set ofr independent identically distributed ( iid) random variables \X\ , . . . , X r } 
such that —A < Xi < A and EpQ] = Ofor each i £ [r]. Let M = Y^i=i Xi (a sum ofXis). Then 



1.1 The Union Bound 

The Robin to Chernoff-Hoeffding 's Batman is the union bound. It shows how to apply this single bound 
to many problems at once. It may appear crude, but can usually only be significantly improved if special 
structure is available in the class of problems. 

Theorem 1.2. Consider t possibly dependent random events X\, . . . , Xf. The probability that all events 
occur is at least 




Pr[\M - E[M]\ > a] < 2exp 





l-£(l-PrPQ]). 



i=l 



That is, all events are true if no event is not true. 



2 Johnson-Lindenstrauss Lemma 



The first example use is the Johnson-Lindenstrauss Lemma [9]. It describes, in the worst case, how well are 
distances preserved under random projections. A random projection (f> : R d — > R k can be defined by the k 
independent (not necessarily orthogonal) coordinates, each expressed separately fa : R d — > R 1 for i 6 [fc]. 
Specifically, <f>i is associated with an independent random vector Ui G S d_1 , that is a random unit vector in 
R d . Then 4>i(p) = (p, Ui), the inner (aka dot) product between p and the random vector U{. 

Theorem 2.1 ([9]). Consider a point set P C R d of size n. Let Q = (/)(P) be a random linear projection 
of P to R k where k = (8/e) 2 \n(n/5). Then with probability at least 1 — 5 for all p,p' G P, and with 
£G (0,1/2] 

(l-£)||p-p'|| < \J^\\^(p)-(f>{p')\\ < {l + e)\\p-p'\\. (2.1) 
To prove this we first note that the squared version of \\4>(p) — 4>(p')\\ can be decomposed as follows: 

k 

\\4>(p)-^p')\\ 2 = Y,\\Mp)-Mp')\\ 2 - 

i=i 

Then since (1 — e) > (1 — e) 2 and (1 + e) < (1 + e) 2 for e G (0, 1/2], it is sufficient and simpler to prove 

Now we consider the random variable M = (d/k)\\<p(p) — 4*{p')\\ 2 /\\p — p'\\ 2 as the sum over k random 
events Xi = (d / k)\\4)i(p) — (j>i{p')\\ 2 /\\p — p'\\ 2 - Now two simple observations follow: 

• E[Xj] = 1/k. To see this, for each ui (independent of other -zv, i ^ i') consider a random rotation 
of the standard orthogonal basis, restricted only so that one axis is aligned to Ui (which itself was 
random). Then, in expectation each axis of this rotated basis contains 1/d of the squared norm of any 
vector, in particular (p — p'). So E[((ui,p — p')) 2 } = {\/d)\\p — p'\\ 2 . Then EpQ] = 1/k follows 
from the linearity of (f). 

• Var[Xj] < 1/k 2 . Since ||</)j(p) — 4>{p')\\ 2 > 0, if the variance were larger than 1/k 2 , then the average 
distance from -EpQ] = 1/k would be larger that 1/k, and then the expected value would need to be 
larger than 1/k. 

Now plugging these terms into Theorem [TTT] yields (for some parameter 7) 



Pr[\M- E[M]\ > a] < 2exp 

and hence solving for k 



\Ak{l/k 2 )) - 7 ' 



1 2 
k > 4-^ In - 
a z V 7 



Set the middle term in (2.2) to M and note E[M] = 1. Now by setting a = e, it follows \2.2\ is satisfied 
with probability 1 — 7 for any one pair p,p' G P when k > (4/e 2 ) ln(2/7). Since there are Q) < n 2 
pairs in P, by the union bound, setting 7 = 5/n 2 reveals that for k > (8/e 2 ) ln(n/5) ensures that all pairs 
p,p' G P satisfy (2. 1 ) with probability at least 1 — 5. □ 
There are several other (often more general) proofs of this theorem ll5ir7ll3l[Tl[T0ll8l [T2"l . 
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3 Subset Samples for Density Approximation 




Again, consider a set of n points P C R d . Also consider a set R of 
queries we can ask on these points. Herein let each q G R corresponds 
to a (i-dimensional axis-aligned rectangle R q = [ai,£>i] x [02,^2] x 
. . . x [ad, bd], and asks for how many points in P are in R q . That is 
q(P) = \PC\Rg\. For example, if P represents customers of a store with 
d attributes (e.g. total number of purchases, average number of purchase 
each week, average purchase amount, . . . ) and R q is a desired profile • • • . • . ■ • " • 
(e.g. has between 100 and 1000 purchases total, averaging between 2.5 ' ■ "•'.'*•, 
and 10 a week, with an average total purchase between $10 and $20, • • .** .'' .• 
. . .). Then queries return the number of customers who fit that profile. * *.**.•'*•*.* 
This pair (P, R) is called a range space. 

We now present a weak version of a theorem by Vapnik and Chervonenkis |[T5l about randomly sampling 
and range spaces. 

Theorem 3.1. Let S C P be a random sample from P of size k = {d/e 2 ) log(2n/<5). Then with probability 
at least 1 — 5, for all q G 

q{P) q(S) 



\S\ 



< £. 



(3.1) 



They key to this theorem is again the Chernoff-Hoeffding bound. Fix some q G R, and for each point Sj 
in S, let Xi be a random event describing the effect on q(S) of Si. That is Xi = 1 if Sj G R q and Xi = if 
^ £ P g , so A, = 1 for all i G [fc]. Let M = Y Ji X i = q{S), and note that E[M] = |5| • g(P)/|P|. 

Multiplying M by = \S\ we can now apply Corollary 1 1 . 1 1 to say 



Pr 



q(S) q(P) 



\s\ 



\p\ 



> e 



Pr[\M- E[M]\ > ek] < 2exp 



-2{ek) 2 



2exp(-2e 2 A;) < 7. 



Solving for k yields that if k > (l/2e 2 ) ln(2/7), then (3.1 ) is true with probability at least 1 — 7 for our 
fixed q G 1^. 

To extend this to all possible choices of q G ^ we need to apply the union bound on some bounded 
number of possible queries. We can show that there are no more than n 2d distinct subsets of P that any 
axis-aligned-based query in 1^ can represent. 

To see this, take any rectangle R that contains some subset of T C P of the points in P. Shrink this 
rectangle along each coordinate until no interval can be made smaller without changing the subset of points 
it contains. At this point R will touch at most 2d points, two for each dimension (if one side happens to 
touch two points simultaneously, this only lowers the number of possible subsets). Any rectangle can thus 
be mapped to one of at most n 2d rectangles without changing which points it contains, where this canonical 
rectangle (and importantly its subset of points) is described by this subset of 2d points. 

Since the application of the Cher noff- Hoeffding bound abov e does not change if the subset defined by R q 
does not change, to prove Theorem 3.1 we need t o sh ow (3.1 1 holds for only n 2d different subsets. Setting 
5 = 7/n 2o! and apply the union bound (Theorem 1.2 1 indicates that k > (d/e 2 ) ln(2n/<5) random samples 
is sufficient. □ 



Extensions: 

• Amazingly. Vapnik and Chervonenkis lfT31 proved an ever stronger result that only k = 0((d/e 2 )log(l/sS)) 
random samples are needed. Note, this has no dependence on n, the number of points ! And more- 
over, Talagrand lfT4l . as reported by Li, Long, and Srinivasan ifTTTl improved this further to k = 
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0((l/e 2 )(d + log (1/5)). So basically the number of samples needed to guarantee any one query has 
at most e-error, is sufficient to guarantee the same result for all queries ! 

• This generalizes naturally to other types of range queries, using the idea of VC-dimension v\ where 
the bound is then k = 0{{\/ e 2 ){v + log(l/<5)). For axis-aligned rectangles v = 2d, for balls it is 
v = d+ 1, and for half spaces it is v = d-\- 1. This last bound for half spaces is particularly important 
for understanding how many samples are needed for determining approximate (linear) classifiers for 
machine learning. 

• These bounds hold if P is a continuous distribution (in some sense it has an infinite number of points). 



4 Delayed Proofs 

Here we prove Theorem [T7T] These proofs are inspired by [4] and associatively |[T3l . 

Markov inequality. Consider a random variable X such that all possible values of X are non-negative, 
then 

Pr[X > a] < ffl. 

Q 

To see this, consider if it was not true, and Pr[X > a] > E[X]/a. Let 7 = Pr[X > a]. Then, since X > 0, 
we need to make sure the expected value of X does not get too large. So, let the instances of X from the 
probability distribution of its values which are less than E [X] /a be as small as possible, namely 0. Then we 
can still reach a contradiction: 

E[X] > (1 - 7)0 + (7)0 = 7a > E ^-a = E[X]. 

a 

Exponential inequalities. We state two simple facts about natural exponentials e x = exp(x). They can 
easily be verified by plotting. 

• e x < 1 + x + x 2 for < \x\ < 1. 

• e x > 1 + x. 



Proof. We will prove the one-sided condition below. The other side is symmetric, and the two-sided 
version follows from the union bound. 

Pr[M-E[M]> a] <exp( EL -^ rM ). ,4.1) 
We can rewrite the left hand side as (with A > 0, and eventually we set A = a/2Var[M]) 
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Pr 



^(X, - E[Xi}) > 



a 



= Pr 



< E 



exp \Xj2( X i ~ E l X i])) > exp(Aa) 



exp ( X^2(Xi-E[Xi]) 



I exp(Aa) 



1 



< 



exp(Aa 
1 

exp(Aa 
1 



nE[exp(A(X i -E[X i ]))] 

% 

J[ E [1 + MX, - EM) + A 2 (Xi - E[X f ]) 2 ] 



-exp M n( i+A2var M> 

, n i exp(A 2 Var[X i ]) _ exp (A 2 Var[M]) 



= exp 



exp(Aa) 

.2 



exp(Aa) 



-a 



4 • Var[M] 



exp 



-a 



4£ < Var[X i 



This holds for a < 2Var[M]/(maxj \X{ — E[Xj]|}). The three inequalities in the above derivation are the 
same three presented above, respectively (i.e. the second line follows from Markov Inequality). □ 



Corollaries. We now prove the Corollary 1 1 . 1 1 and Corollary 1.2 up to a constant factors in the exponent. 
Direct proofs can achieve the stated bounds, but for brevity we only show these less repetitive reductions. 

To see Corollary [□] we need to bound the variance of random variables in an interval [ai , bi] with 
Aj = bi — a^. This occurs when E[Xi] = (a-i + bi)/2 and JQ takes only positions «j or bi with equal 
probability, yielding Var[Xj] = A?/4. Plugging this result into Theorem 1 1 . 1 1 results in 



Pr[|M- E[M]| > a] < 2 exp 



-a* 



To see Corollary 1.2 from Corollary |1.1[ set each Aj = 2 A and E[M] = 0. 



□ 



4.1 On Independence and the Union Bound 

The proof of the union bound is an elementary observation. Here we state a perhaps amazing fact that this 
seemingly crude bound is fairly tight even if the events are independent. Let Pr[JQ] = 1 — 7 for i G [tj. 
The union bound says the probability all events occur is at least 1 — tj. So to achieve a total of at most 5 
probability of failure, we need 7 < S/t. 

On the other hand, by independence, we can state the probability of all events is (1 — 7)*. By the 
approximation for large s that (1 — x/s) s w e~ x we can approximate (1 — 7)* e~ 7 *. So to achieve a total 
of at most 5 probability of failure, we need 1 — 5 > e~ 7 *, which after some algebraic manipulation reveals 
7 < ln(l/(l - S))/t. 

So for 5 small enough (say 5 = 1/100, then ln(l/(l - 5)) = 0.01005 . . .) the terms S/t and ln(l/(l - 
S))/t are virtually the same. The only way to dramatically improve this is to show that the events are 
strongly negatively dependent, as for instance is done in the proofs by Vapnik and Chervonenkis [15] and 
Talagrand lfi"4l . 
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